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ABSTB&CT 



Over the past several years benchmarking has beer, devel- 
oped into an effective technique for performance analyses of 
computer systems. Relational database machines are rela- 
tively new compter systems for which a benchmarking tech- 
nique does not yet exist. 

Ihe benchmarking of relational database machines 
involves the indent ification and design of test programs 
through which relevant performance data can be gathered and 
interpreted. All features of relational database management 
must be considered when designing these test programs. The 
join operations are an important feature of relational data- 
base management. 

The test programs for the join operations necessarily 
include the repetition of certain queries during which 
specific join parameters are varied. These parameters 
include: tuple size, relation size, disk placement, and the 

use cf indices. A number of join operations have been 
benchmarked. These operations are equality joins, 

inequality joins, three-way joins, and virtual joins (i.e. , 
views) . In addition, a number of relarional database 
machine configurations have been utilized for benchmarking 
the jcin cperations. 

The highlights of the thesis can be found in ins contri- 
buticn tc a benchmarking technique for the join operations 
and its conclusions on the performance analyses of various 
relational machines in operating joins. 
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INTRODUCTION 



A. SHAT IS BENCHMARKING? 

The term "benchmark" has irs origin in the field of 
geographical surveying. A benchmark is a permanent 

geographic feature which serves as a landmark for surveying. 
The term has evolved into defining a standard or criterion 
associated with a particular type of system or product. 
This standard serves as a poinr of reference to which func- 
tionally similar systems or producrs can be compared. 

In the realm of computer science a benchmark consists of 
a standard set of instructions or programs. The execution of 
the set cn cne system provides measurements that can be used 
to ccmpare with measurements obtained by running the same 
set on another system. This is the essence of computer 
system benchmarking: the process of conducting controlled 

experiments to collect indicators of comparative performance 
cf different computer sysrams. 

B. THE "GIBSON MIX" 

Ccmparisons of computer systems were prompted by the 
increasing application of the systems in business and other 
situations in a cost-effective way. This interest in compa- 
rative performance of systems had resulted in the controlled 
experiments cf the systems. In 1970, J.C. Gibson introduced 
a system of programs sets or "mixes" by which variable types 
cf workloads could be compared. The "Gibson Mix" approach 
to comparing systems is based on testing several se+s of 
applications in both business and science. The results, 
execution times, cf these tests were published. The 
problem of selecting a particular computer system could be 
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reduced tc establishing workloads as mulriples of the aix. 
By properly balancing execution times and mix multiples, 
system evaluators could produce comparative estimates for 
total computer systems. 



C. BENCHHAEK DESIGN AND OBJECTIVES 



Benchmarking as a technique for comparisons of computer 
performance has enjoyed increasing popularity over the past 
decade. This approach is appealing both to producers and 
consumers. Basic guidelines have been developed for the 
proper use of benchmarks. The benchmark must be representa- 
tive cf real-world wcrkloads, and the mix of instructions 
should be inclusive enough to provide as much relevant data 
as possible. Additionally, the relevance of benchmark 
content must be justifiable. The benchmark should be care- 
fully designed, and objectives should be specifically stated 
so that the proper sequence of steps in the benchmark 
progression can be set down. Objectives may include evalua- 
tion towards procurement, design analysis, component certi- 
fication, quality determinations, load analysis, improvement 
of performance, or ether objectives as determined by these 
requiring the benchmark. The benchmark should be tailored 
to the objective and deal with those demands or applications 
which initially formed the basis of and the requirement for 
the benchmarking. The benchmark must be controlled from 
design through implementation and throughout the interpreta- 
tion of the results. 
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II. THE BENCHMAR KING ENVIR ONM ENT 



Ths “xperiments described in this paper have been 
conducted on several configurations of an RDM 1100 at the 
Data Processing Service Center West, Naval Air Station, 
Point Mugu, California. The RDM 1100 and its various 
configurations are relational database machines, each of 
which is designed to be the backend of UNIVAC 1100 series 
computers, 

A. TEE HOST COMPOTEH 

The hcst computer system of which a relational database 
machine is used as the backend is the UNIVAC 1100/42. No 
modifications have been required of the UNIVAC operating 
sysrem. Specially designed host-resident software has been 
installed in the UNIVAC. 

B. THE HCST COMPUTEB/DATAB ASE MACHINE INTERFACE 

Figure 2.1 depicts the presently available methods for 
interfacing between the host and the backend. The first 
merhcd, the relational query language, is a command inter- 
face. The second method allows the user to execute a series 
of queries by referring to a set of stored commands. The 
third method is via user programs written in high-level 
programming languages such as COBOL and FORTRAN in which a 
subroutine is provided for accessing data stored in the 
backend machine. 

In the RDM, host interfacing is accomplished by both 
parallel and serial interface modules (processors) (see 
Figure 2.2) . Each interface module can support up to 8 host 
systems . 
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Figure 2. 1 The Host/Baclcend Interface. 



13 




Figure 2.2 The Database Machine Architecture. 
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THE EENCHHABKED RELATIONAL DATABASE MACHINE 



The basic relational database machine on which the 
benchmarking experiments have been conducted is a modularly 
designed, microprocessor- based darabase computer. The 
modules are organized around a single high-speed bus (see 
Figure 2.2 again). 

^ • Technol ogy and Funct ional i ry ot Modul es 

a. The Database Processor 

This-Z8000 series microprocessor controls the 
flow cf dara by translating user queries into procedures. 
Additionally, this processor supervises system resources, 
coordinates hardware monitoring, and performs bus arbitra- 
tion. The processor contains approximately 99% of C-codes 
and operates at 1/2 MIP. If the database accelerator 
(described below) is available, the database processor 
senses its availability and issues calls for its services. 

b. The Accelerator 



This high-speed, a 
executes instructions at 10 MIPS 
It has a three-stage pipeline and 
well-defined collection cf often 
subroutines. Ihe accelerator 
transfer rates. 



uxilary processor which 
is built from ECL logic, 
is designed to optimize a 
used database management 
can filter data at disk 



0 . The Cache 

This main memory is composed of 64K dynamic ram 
chips and is expandable up to 6 megabytes. System informa- 
tion and cods occupy approximately 360K of this memory. 
Cache is allocated in 2K blocks, contiguously whenever 
possible. The paging algorithm is basically 

Laast-Recently-Used , and the system code is never paged out. 
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d. Disk Drives and rhs Secondary Storage 

The disk controller moiale performs burst error 
detection and correction and retry without intervention by 
the database processor. This controller can manage from one 
to four disk drives with each drive having a capacity of one 
to four disks. Presently there are two disks available with 
each disk capable of storing approximately 600 megabytes of 
data . 

2. D if f er e nt Ac cel era tor and Cache con fi gur a tion s 

Tested 

The benchmarking experiments have been conducted on 
the following different machine configurations; 

a. 1/2-megabyte cache without the database 
accelerator 

b. 2-megabytes cache with the database accelerator 

0 . 2-megabytes cache without the database 
acce Isratcr 

D. THE DATABASES 

The relational database machine handles data in 2K byte 
blocks. With this in mind a syntesized database has been 
designed. Tuple (record) lengths of 100 bytes, 200 bytes, 
1000 bytes, and 2000 bytes have been chosen, thereby 
providing a range of 1 to 20 tuples per block. It has been 
sought through experimentation to contrast rhe same opera- 
tions performed on relations with different numbers of 
tuples per block. It is felt that this approach may provide 
some measure of processor-overhead time versus I/O time. 
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• D atabase Gen erat io n 

Standard templates for each of the four different 
tuple lengths have been designed. Table I describes these 
templates. Note that within each template there are attri- 
butes (fields) that are common to all four templates: 
sequential integers and random integers. The attributes of 
sequential and random integers can be used to enforce 
different orderings of the same data. Each template also 
contains attributes specified with values uniformly distri- 
buted over a number of enumerated values. 3y ensuring 
specified distribution, the reliability of equality joins 
can be assured. 

The actual relations for the experimental databases 
have teen generated cn an IBM 3033 system in batch mode and 
have been transferred to tape for transport to the UNIvac 
system. For each of the four tuple lengths, relations have 
been generated with 500, 1000, 2500, 5000, and 10000 tuples. 

2* D at a ba s e Cre atio n, Loadi ng, and Disk Plac em ent 

In the environment cf the database machine, the 
number cf 2K-byte blocks assigned ro a darabase is specified 
with the CREATE DATABASE command in the query language 
(Section II. E further describes the query language). Since 
database allocations are made in the whole number of cylin- 
ders, the number of blocks specified will be rounded up to 
the first whcle number of cylinders. Once the allocation is 
mads, the number of blocks actually allocated is returned to 
the user. The syntax for database creation in the query 
language is: 



CREATE DATABASE (name) WITH (options) 
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options : 

demand - the number of blocks to allocate 
disk - the disk on which allocation is desired 



For example, 

CREATE DATABASE NPSTEST with demand = 1000 on "DSK001”, 

demand = 2 000 on '‘DSKSY3'' 



would set aside 1000 blocks on the disk named "DSK001” and 
2000 blccks on the disk named "DSKSYS” for the database 
"NPSTEST*'. 



Once the database 
relations in that database 



has been assigned disk 
may be created as follows: 



space , 



CREATE relation-name ((field name) = (format),..,, 
(field name) = (format)) 



The above command would set up an empty relation in the 
database to which tuples could then be appended. A database 
is opened by simplying entering: "OPEN (database name) ". 

In order tc bulkload records into relations in 
specified databases, utility programs have been provided. 
The experimental relations that have been generated on the 
IBM 3033 system and subsequently loaded into the ONIVAC 
system have been translated into the backend machine using 
these utility programs. 

Initially, we have attempted to manipulate the 
placement of relations in a database. That is, once a data- 
base has been allocated with disk space by the CREATE 
command, w<= have tried to force a specific placement of a 
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relation on a particular disk. Wa have assumed that for a 
join, optimization can be achieved if the relations tc be 
joined are physically located on different disks. However, 
cur attempts at placement have proven futile. The designers 
of the backend machine utilized certain placement algo- 
rithms. These algorithms are proprietary and are, there- 
fore, unavailable for our modification. 

The query language for the machine allows the crea- 
tion of indices for quicker data access. The creation of 
these indices and their use is described in the following 
section. 

3 . Indices 

Simply stated, indices are designed to provide more 
direct access to stored data. The query language for the 
relational database machine allows for the creation of two 
different types of indices. A "clustered” index is one for 
which the tuple is physically in the order of -^he value in 
the specified field. A "nonclustered" index is one that is 
created for a field or group of fields for which the tuple 
is not clustered. 

Note that in NPSTEST all of the relations have beer- 
created with clustered indices. Also, as they are described 
below, indices for certain relations in other experimental 
databases may be created, destroyed, and then recreated 
during the course of the run stream for a particular join 
experiment. 

4 . The Ex peri me n tal Da tabas es 

Table II describes the experimental databases. As 
they are explained more fully below in individual experiment 
descriptions, the size of the databases, the number of rela- 
tions in the databases, and the indices employed are all 
factors in the measurements obtained. 
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E. THE COERY LANGOAGE FOB THE DATABASE MACHINE 

Incorporated as an integral part of the relational data- 
base system is the query language (RQL in the case of the 
RDM 1100). This query language is designed to be both a 
definition language and manipulation language for the data 
stored in the machine. 

1 . S emant ics and Svnt a x 

The use of the CREATE command for both databases 
and relations has previously been discussed. The following 
discussion seeks to describe those features of the query 
language that are essential to an understanding of the 
nature of experiments that have been conducted on the join 
operation. 

a. BEGIN (transaction name) 

This command is used whenever multiple RQL ccmuiands are to 
be treated as a single command. 

b. END (transaction name) 

This command is used at the 'end of the group of RQL commands 
under BEGIN. 

c. CREATE VIEW (view name) 

This command is used to set up a virtual relation within a 
database . 

d. DEFINE (stored command name) 

This command is used to define a stored command for a parti- 
cular database. The command so defined can be referenced 
simply by its name. 
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e. DESTROY (object name) 

ccnunand is used zo eliminate databases, relations, 
css, views, stored commands, or other constructs from 
system . 

f. RANGE of (range variable) is (relation name) 

e variables are used to allow the user tc establish a 
nym for a relation name. Once this synonym is estab- 
ed it can be used in lieu of rhs relation name. 

g. RETRIEVE (target list) MHERE (qua lif rear ion) 

is the most essential command for performing jcin oper- 
ns. Relations or portions of relations are pulled from 
age and displayed for the user. The data retrieved 
nds on the user supplied gualif icarion which may include 
ular or multiple equalities and inequalities. Op to 22 
ds can be specified in the targer lisr. 

h. RETRIEVE (variable name = GETTIME () ) 

IBE is a function in RQL that allows the user to 
ieve a time statement from the RDM clock. The time 
ger retrieved is in 1/60 seconds. As will be described 
w, we used these times for our compuxarions . 

2 • Exp erimen tal Qu eries 

Queries have been designed utilizing those features 
RQL described above. The query streams have been 

gned as sets of transactions, and the joins have been 
gned as stored commands so that the commands could be 
parsed in order that parsing time would be eliminated 
the join time measurements. The number of fields 
eted for a jcin is described below in individual experi- 
descrip-ions. 
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III. THE BENCH M&BKING 



A. GCALS 

The experiments described in this paper are directed 
towards the development of a procedure by which database 
machines may be benchmarked. The efforts described here 
represent cnly a pcrtion of the research. Interested 
readers are directed to [Ref. 1], [Ref. 2], and [Ref. 3] for 
additional research cn selection and projection, database 
administration, and database generation. 

Ihe goal of these experiments is not to make a defini- 
tive pronouncement cn the performance of the various 
configurations of the RDM 1100. Rather, the goal is to 
learn how to design benchmarks and interpret the results of 
the benchmarking experiments. Towards this end, the method- 
ology must be machine independent, and the workload model 
must be based on a mix of database management statements. 

B. THE MITHODOLOGT 

The workload has been modeled as a collection of queries 
in rhe relational query language (RQL) . The primary bench- 
mark kernel for the join operations is the RETRIEVE state- 
ment with associated qualifications. In designing this 
workload, classes of queries have been identified. These 
include data-int ensive and overhead-intensive classes. The 
workload has been constructed as a combination of queries 
from each class. The query language has functioned as the 
primary tool for performance measurement since neither 
sorftware nor hardware probes have been available for use in 
conducting these experiments. Using the functions provided 
in the query language, elapsed times are measureable from 
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the database machine clock in seconds. Since the goal of 
these experiments is to learn the effects of varying parame- 
ters on machine performance and not absolute machine perfor- 
mance, this "rough" measurement technique is acceptable. 

The operating system of the host machine allows the use 
of pre-defined commands and queries known as scripts which 
has eliminated the fluctuation of terminal time. 
Additionally, the fluctuation of the parse time has been 
eliminated by using pre-parsed commands stored in the data- 
base. However, some fluctuation is introduced by the 
query-post processor which formats data for screen display, 
but this is not significant within the query sets. 

The initial approach to defining relevant queries has 
been to concentrate on the repetition of certain operations. 
During this repetition, given factors have been varied to 
ascertain effects on performance. For the join operations, 
tuple sizes, database sizes, index structure, disk place- 
ment, and machine configuration have been varied. 

By and large, the query streams have been run in a cont- 
rolled environment. To offset the workload variability of 
the host machine, runs have been conducted during times of 
minimal activity on the host. Likewise, use of the database 
machine has been restricted to a single user. 

C. THE JCIN OPERATIONS 

Several groups of experiments have been conducted during 
which certain parameters have been varied during repetitions 
of the same experiment. These experiments have been 
designed to obtain measurements on one particular aspect of 
relational database management; the join operations. 
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1 . A Fcrtnal D<=f i nition of a Join 

Simply seated, a join is a composition of two or 
more relations. In relational algebra, a join can be 
expressed as follows: the 6-join of column x of table R and 
column y of table S is a table whose rows are in the 
Cartesian Product of R and S, such that, for the mathemat- 
ical operator 9, the row element x of R and the row element 
y of S held true for 6. 

2 . The ^in in the Ben chmarked L anguage 

In the benchmarked query language, RQL, a join is 
accomplished using the RETRIEVE, RANGE, and qualifier WHERE 
commands. For example: 
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Given the above relations, a typical join query in RQL could 
be: 

RANGE of ? is Personnel 

RANGE of D is Department 

RETRIEVE (P. last name, D. phene, D. office) 

WHERE P.dept = D.name 



This query would return; 

rrr A sTn A mf i phi inf i off i ce 
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D. EQOAIITY JOINS 



1 . The Def ini ticn and Examples 

An equality join is one in which 0 is defined as the 
mathematical egualtiy (i.e. , =) . That is, the statement 

following the qualifier WHERE in RQL contains either a 
singular cr multiple equalities. For example, using the 
relations described above, the following retrievals repre- 
sent two different equality joins: 

RET R I EVE (P. las tna me, D .phone) 

WHERE P.Dept = D.name and P.age = 

or 

RETRIEVE (P. lastname,D. phone) 

WHERE P.dept = D.name and D.name = '’OPS” 
and P.age = ”25” 

2 . The Used 

Equality joins represent the vast majority of exper- 
iments conducted during this research. Equality joins have 
been conducted on all of the databases listed in Table II. 

3 • Queries Used 

Equality joins have been run with both singular and 
multiple qualifications (i.e., singular or multiple equali- 
ties in the WHERE clause) . The majority of the joins have 
been conducted on singular qualifications, and the discus- 
sions below focus primarily on those experiments. The 
multiple-qualification joins will be discussed separately. 
The singularly-qualified joins are equated on the KEY field 
of each relation. 
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4 . Res ult s 

As prsvicusly explained, the methodology emphasizes 
varying parameters throughout the repetition of the same 
group of experiments. The gueries and their results are 
presented below, and they are grouped by the parameter that 
has been varied. 

a. Variability of Relation Size and Tuple Size 

Figure 3.1 depicts three joins of relations 
whose tuple size is of 100 bytes. The first equality join 
involves a relation of 500 tuples and another relation of 
1000 tuples. The second equality join involves a relation 
of 2500 tuples and another of 5000 tuples. The third 
equality join involves a relation of 5000 tuples and another 
of 10000 tuples. It is clearly evident that the join times 
increase linearly as the number of tuples being joined 
increases linearly. 

We now vary the tuple size for all three rela- 
tions. Thus, we benchmark the three relations whose tuple 

size is of 200 bytes. This is depicted in Figure 3.2. The 
benchmark of the relations whose tuple size is of 1000 bytes 
is depicted in Figure 3.3, and the benchmark of the rela- 
tions whose tuple size is of 2000 byres is depicted in 

Figure 3.4. The linearity demonstrated earlier in Figure 

3.1 is again evident in these joins. 

Figure 3.5 is a compilation of Figures 3.1, 3.2, 
3.3, and 3.4 in which the slopes (or the rates) of linearity 
may be compared. I t is important to note that, the bigger 
the tuple size there is; the steeper the slope will be. 
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Figure 3.1 Benchmarked Relations - Small Tuple Size. 
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Figure 3.2 Benchmarlced Helations - Medium Tuple Size. 
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Figure 3.3 Benchmarked Relations 



Large Tuple Size. 
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Figure 3.4 Benchmarked Relations - Very Large Tuple Size. 
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Figure 3.5 Relative Performance - Changes in Tuple Size. 
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b. Variability of Database Size in Terms of Number 

of B loc k s 

Figure 3.6 depicts three join operations over a 
large database of 31350 blocks, HPSTEST. Additionally, 
represented in Figure 3.6 are the same three joins over 
three small databases, NPS4 of 150 blocks, NPS5 of 750 
blocks, and NPS6 of 1500 blocks. The results of these 
queries reveal that the block size is not a significant 
factor in jcin time. 

c. Variability of Database Disk Placement 

Every database, namely, NPS1, NPS2, NPS3, NPS11, 
NPS12, or NPS13 contains only two relations. NPSI, NPS2, 
and NPS3 have been created on the same disk. NPS11, NPS12, 
and NPS13 have been created separately, each of which occu- 
pies two disks. Figure 3.7 depicts the time for joins on 
NPSI, NPS2, and NPS3 versus the same joins conducted on 
NPS11, NPS12, and NPS13. The results strongly suggest that 
database disk placement, especially for relatively small 
databases, is net a major factor in join time. 

d. Variability of Index Structure 

A query stream has been run on NPS11, NPS12, and 
NPSI 3. During the run the index structure on the relations 

in the databases has been modified from clustered to 
nonclustered and then eliminated. Figure 3.8 depicts the 
jcin times in each situation. From the results obtained, it 
can be reasonably assumed that for relations of this size 
there is no significant difference between join times on 
clustered and nonclustered indices. However, the join tiroes 
for those relations with no indices have exhibited increases 
exponentially. 
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Figure 3.6 The Impact of Database Block Size on Joins. 
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Figure 3.7 The Impact of Disk Placements on Joins. 
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Figure 3.8 The Impact of ladices on Joins. 
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6. Variability of Machine Configuration 

As stated earlier, during the course of these 
experiments, the database machine being benchmarked has 
operated under three different conif gurations ; 1/2*megabyte 
cache memory without a database accelerator, 2*megabyte 
cache memcry without a database accelerator, and 2-megabyte 
cache memcry with a database accelerator. Joins over rela- 
tions with the fixed tuple size of 100 bytes in the data- 
base, NPSTEST, have been conducted on all three 
configurations. The comparative results of these joins are 
depicted in Figure 3.9. These results show that an increase 
in cache memory size from 1/2 to 2 megabytes improved join 
time by a factor of 21 % to 31*^. The addition of the data- 
base accelerator to the 2- megabyte cache improved the join 
time by a factor of 6% to 12% only. These results would 
seam to clearly indicate that, for the join operation, a 
larger cache memory is much more effective than the addition 
of a database accelerator. 

5. S elect ion E xperime n ts 

In addition tc the equality joins described so far, 
there has been an additional qualification designed to 
select only a certain portion of the joined tuples for 
display. The number of tuples to be displayed is to b- 5 % 
of the number of tuples in the smaller relation of the two 
relations in each join. To accomplish this objective for 
the jcin cf the 500-tuple relation and the 1000-tuple rela- 
tion, the additional qualification is tc impose a ’•< 25” 
restriction on the KEY attribute. That is, the relations 
have teen joined on the equality of the KEY field in each 
relation, and there has been the additional qualifier that 
those tuples to be displayed must have a KEY value that is 
less than ”25”. For the join of the 2500-tuple relation and 
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Figure 3.9 The Impact of Machine Configurations on Joins. 
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the 5000-tuple relaticn the restriction is "< 125”, and for 
the jcin of the 5000-tupla relation and the 10000-nuple 
relation the restriction is ”< 250". 

Figure 3.10 depicts the response times for these 
join selections. Figure 3. 11 depicts the response times for 
the same joins for which there is no 5%-selection res-ric- 
tion. a comparison of the results of each join reveals that 
especially for the jcin of the larger relations the differ- 
ence in response time is proportionally greater. These 
significant differences are likely due to at least two 
prevalent factors. First of all, there is an I/O overhead 
that undoubtedly comprises a major portion of the differ- 
ence. Secondly, it is highly probable that for this type of 
join the select operation is performed first, and then the 
actual jcin is performed. A comparison of Figures 3.10 and 
3.11 would support this hypctheis, 

6 • ether Sguali t y-Joi n Experime nts 

Figure 3.12 depicts a comparison between two se*s of 
three joins on the same relations with nonclustered indices. 
The first set requires no relations to be sorted. The 
second set requires the relations to be sorted on an attri- 
bute ether than the KEY attribute on which the index is 
based. The comparative results of the runs for these joins 
are close. The plotted curves for the response times cross 
themselves. This may indicate that the sorting of relations 
on the basis of a non-key attribute doss not improve the 
jcin time. 

Figure 3.13 depicts a comparison between two sets of 
the same three joins for which the expression of the 
equality predicate has been reversed. For these particular 
joins the reversal of the expression of the equality predi- 
cate appears to be insignificant as a factor in join time. 
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Figure 3.10 Three 5X-Join Selections. 
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Three Joins Without Selection. 
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Figure 3.11 
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Figars 3.12 Joins on Sorted and CJnsorted Relations, 
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E. INEQDiLITY JOINS 



A limited number of inequality joins has been conducted 
during the course of these experimenrs. 

1 . A Defi n iti on 

Per these experiments an inequality join is one in 
which 9 is defined as a mathematical inequality. That is, 
the statement following the qualification WHERE in RQL 
contains either '•!" or '*<•• cr This qualification has 

been imposed on the KEY attribute. 

2 • Exp erime nt s 

Inequalities have been applied to the join of a 
500-tuple relation and a 1000-tuple relation and to the join 
of a 2500-tuple relation an d a 5000-tuple relation. 

3 • Dis ast r o us Results 

The results of these joins have proven to be disast- 
rous. For even the smaller join of the 500-tuple relation 
and the 1000-tu?le relation, the response time has run into 
hours. This long response time has jeopardized the 
integrity of the experiments, since during the course of the 
run the status of the host machine has experienced signifi- 
cant fluctuations in load conditrons. Obviously, it may 
prove the point that the inequality joins cannot be 
supported by the machine with any reasonable response time. 

F. THE THREE-WAY JOIN 

1 . A Defi n i ti on and Ex ample 

For these experiments a three-way join is simply a 
composition of three relations via equality joins. The 
three relations have been joined on the equality of the KEY 
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attribute of each relation. That is, for relations A, B, 
and C, the join has been accomplished WHERE A. KEY = B. KEY 
and B.KEY = C. KEY. 

2 • E xreriment s 

The three relations that have been joined are a 
500-tuple relation, a 1000-tuple relation, and a 2500-tuple 
relation. No selection restriction has been imposed or the 
join . 

The response time for this query is .8114 minutes. 
A two-way join, under similar conditions, cf the same 
500-tuple and 1000-tuple relations has been accomplished in 
.7011 minutes. The small increase of the response time from 
the two-way join to the three-way join of .1103 minutes 
(15.75?) would appear to further demonstrate the significance 
of the cne-time I/O overhead in joins. In other words, 

regardless cf th“ numter of ways a join is to be conducted, 
the cne-time I/O overhead would consume a substantial 
portion of the join time. In this case, the overhead 
consumes abcut 65% of the three-way join time. 

G. JOINS VERSOS VIEiS 

^ • liiS iii the Ben chm arked Langu age 

In RQL the CHEATS VIEW command is used to set up a 
virtual relaton which is composed of attributes cf one or 
more relations. The VIEW is not physically a relation. 
Rather, its definition is stored in the database. The 
following example creates a new virtual relation, LOCATOR: 

RANGE of P is Personnel 
RANGE of D is Department 

CREATE VIEW LOCATOR (P. name ,D. name ,D. of fice, D. phone) 

WHERE ?. dept = d.name 
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2 • Expe rt nient s on Views 

The views have been defined and stored in the appro- 
priate databases, before their use for comparison to join 
operations. For both the views and the joins, projection 
has been limited to five attributes, bur no restriction has 
been imposed on selection. 

The views have been created from a 500-tuple rela- 
tions and a 1000-tuple relation; a 2500-tuple relation and a 
5000-tuple relation; and a 5000-tuple relation and a 
10000-tuple relation. These relations exist in databases 
NPS11, NPS12, and NPS13, respectively. Likewise, the joins 
have teen accomplished on these same relations and data- 
bases . 

Figure 3.14 depicts the comparative response time 
for each of the three situations. The remarkable similarity 
in response times between views and joins for these experi- 
ments would seem to point out that the views are no more 
expensive and inefficient to use than the joins. In certain 
situations, however, the views could be of greater value, 
since they require very little disk space as compared to the 
physical space needed by the tuples of the jcins. 

Additionally, the view appears to provide the user greater 
flexibility for controlling access to the database. 
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Joins Versus Views. 
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Figure 4.1 Block access Times. 
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Figure 4.2 also reveals that for this particular data- 
base machine that it is more efficient (or profitable) to 
perform jcirs on larger relations. The access times for the 
smaller relations are much higher than the access times for 
the larger relations. As the size of the relation 

increases, the mean access time demonstrates a convergence 
to a representative number. This number, the mean access 
time, can he considered an important charact ersitic of this 
particular benchmarking experiment. 

B. A COMPARISON OF DIFFERENT ACCELERATOR/CACHE 

CONFIGORATIONS 

This benchmarking experiment has not been designed as an 
analysis of several differently configured RDM 1100s. 

However, while this benchmarking is making progress, the 
availability of more cache and the database accelerator has 
stimulated much interest in the performance differences for 
the different machine configurations. Therefore, consider- 
able time has been expended towards accumulating comparable 
data for each of the three configurations on which experi- 
ments have been run. 

In Chapter III there is a brief discussion of the 
differences in join times for the relations of 100-byte 
tuples. The following discussion focuses cn the 24 joins 
conducted on the database, NPSTEST, for each of the -^-hree 
configurations . 

Table III summarizes the average percentage decrease in 
join time for each join as the amount of cache is increased 
from 1/2 megabyte to 2 meagbytes. Table III also summarizes 
the further decrease in join time as the database acceler- 
ator is added to the 2-megabyte cache configuration. This 
summary reveals larger decreases in the join time as the 
sizes of the relations being joined increase. In ether 
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words, as the initial join overhead is absorbed, the addi- 
tional cache increasingly decreases the join time by a 
percentage of apprcximarely 59%. Correspondingly, ir 
appears that the effects of adding a database accelerator to 
the 2-megabyte cache are less significant for the larger 
relations, although in all cases there is some improvement. 

C. THE METHODOLOGY AND ITS LIMITATIONS 

The methodology that has teen discussed in this pacer 
has fundamentally sound origins, and the experimental 
approach of varying join parameters has and should continue 
to provide relevant information from which insight can be 
drawn. However, as discussed above, benchmarking is a rela- 
tively new area of research in computer science, and 
certainly the techniques that have been applied throughout 
the course of these experiments can be improved and refined. 

A definitive performance pronouncement on the HDH 1100 
has not been the ultimate goal due to the use of the GETTIME 
function of RQL. Despite its "coarseness" in getting 
performance measurements, the GSTTIME function has been 
deemed accurate enough for the purposes of our experiments. 
Actually, this function has been considered sufficiently 
accurate in view of the lack of other more accurate measure- 
ment tools. Probes have not been available, and software 

packages for performance data collection have been delayed 
and are unavailable for these experiments. Future attempts 
to benchmark such a system should utilize additional methods 
for determining relevant performance data. 

The benchmarking of the RDM 1100 is a project of seem- 
ingly low priority at the command which houses the host 
ONIVAC system. Existing workloads demand vast amount of the 
system's resources, and in reality it has been quite diffi- 
cult to "control" the environment in which these experiments 
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have fce=n ccr.ducted. Thus the load conditions of the host 
system may have compromised the integrity of some results. 
Additionally, the majority of the experiments have beer, 
conducted from a remote terminal which has probably further 
degraded the experimental results. Obviously, on site, 
strictly controlled experimentation is the ideal practice 
for benchmarking experiments. 

Our inability to control the host environment raises yet 
another issue. The goal of the experiments has been to 
collect measurements on joins for which certain parameters 
are varied. However, a major parameter has not been varied. 
That parameter is the load condition of the host. As 
described above, attempts have been made to run experiments 
at times of minimal host activity. In actual practice, the 
database machine is likely to be benchmarked during periods 
of peak host activity. Future benchmarking efforts should 
take this into consideration, and attempts should be made to 
control and vary host load conditions as part of the mix of 
query scripts. In view of the minimal host activity, the 
results we have obtained may be considered as the optimal 
performance of the RDM 1100 for join operations. 

As the deadline for submission of this thesis has drawn 
near, planned experiments have been cancelled from the 
testing agenda. A "time crunch" has resulted from a variety 
of sources. Primary of these sources has been the contin- 
uing requirements to correct software deficiencies that have 
been identified as a result of the experiments that have 
been conducted. Likewise, the changing of the database 
machine configuration has also severely cut into the time 
available to run the full set of planned experiments. In 
essence, although a great deal of relevant data has been 
collected, the consistency of some data may be questionable 
since a limited number of experiments has been conducted in 
each area of experimentation. 
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Besides these limitations and deficiencies, the experi- 
ments that have been conducted have provided enough relevaar 
information from which valuable conclusions can be drawn. 
The results of the join experiments described hare, when 
combined with those results of selection and projection 
experiments, comprise a substantial starting point for rhe 
comparison of similar database machine architectures. They 
provide a solid framework for benchmarking relational data- 
base machines. 
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